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Abstract 

Factorized information criterion (FIC) is a recently developed approximation 
technique for the marginal log-likelihood, which provides an automatic model se¬ 
lection framework for a few latent variable models (LVMs) with tractable infer¬ 
ence algorithms. This paper reconsiders FIC and fills theoretical gaps of previous 
FIC studies. First, we reveal the core idea of FIC that allows generalization for a 
broader class of LVMs, including continuous LVMs, in contrast to previous FICs, 
which are applicable only to binary LVMs. Second, we investigate the model selec¬ 
tion mechanism of the generalized FIC. Our analysis provides a formal justification 
of FIC as a model selection criterion for LVMs and also a systematic procedure for 
pruning redundant latent variables that have been removed heuristically in previ¬ 
ous studies. Third, we provide an interpretation of FIC as a variational free energy 
and uncover a few previously-unknown their relationships. A demonstrative study 
on Bayesian principal component analysis is provided and numerical experiments 
support our theoretical results. 

1 Introduction 

The marginal log-likelihood is a key concept of Bayesian model identification of la¬ 
tent variable models (LVMs), such as mixture models (MMs), probabilistic principal 
component analysis, and hidden Markov models (HMMs). Determination of dimen¬ 
sionality of latent variables is an essential task to uncover hidden structures behind 
the observed data as well as to mitigate overfitting. In general, LVMs are singular (i.e., 
mapping between parameters and probabilistic models is not one-to-one) and such clas¬ 
sical information criteria based on the regularity assumption as the Bayesian informa¬ 
tion criterion (BIC) [Schwarz, 1978] are no longer justified. Since exact evaluation of 
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the marginal log-likelihood is often not available, approximation techniques have been 
developed using sampling (i.e., Markov Chain Monte Carlo methods (MCMCs) [Hast¬ 
ings, 1970]), a variational lower bound (i.e., the variational Bayes methods (VB) [At- 
tias, 1999, Jordan et al., 1999]), or algebraic geometry (i.e., the widely applicable 
BIC (WBIC) [Watanabe, 2013]). However, model selection using these methods typi¬ 
cally requires heavy computational cost (e.g., a large number of MCMC sampling in a 
high-dimensional space, an outer loop for VBAVBIC.) 

In the last few years, a new approximation technique and an inference method, 
factorized information criterion (FIC) and factorized asymptotic Bayesian inference 
(FAB), have been developed for some binary LVMs [Fujimaki and Morinaga, 2012, 
Fujimaki and Hayashi, 2012, Hayashi and Fujimaki, 2013, Eto et al., 2014]. Unlike 
existing methods which evaluate approximated marginal log-likelihoods calculated for 
each latent variable dimensionality (and therefore need an outer loop for model selec¬ 
tion), FAB finds an effective dimensionality via an EM-style alternating optimization 
procedure. 

Eor example, let us consider a Ff-component MM for N observations = 
(xi,..., xat) with one-of-AT coding latent variables = (zi,..., zjv), mixing co¬ 
efficients P = (Pi,..., Pk), and Uh- dimensional component-wise parameters H = 
{€l; ■ • ■ ) ^k}- By using Laplaces method to the marginalization of the log-likelihood, 
EIC of MMs [Eujimaki and Morinaga, 2012] is derived by 
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maxE„ 

9 


lnp(X,Z I A 


K) 


2 N 


+ H{q)-^\nN, 


( 1 ) 


where q is the distribution of Z, /3 and H are the maximum joint-likelihood estima¬ 
tors (MJLEs)', Du — Ds + K — 1 h the total dimension of H and /3, and H{q) is 
the entropy of q. A key characteristic of EIC can be observed in the second term of 
Eq. (1), which gives the penalty in terms of model complexity. As we can see, the 
penalty term decreases when Znk —the number of effective samples of the /c-th 

component—is small, i.e., Z is degenerated. Therefore, through the optimization of q, 
the degenerate dimension is automatically pruned until a non-degenerated Z is found. 
This mechanism makes EAB a one-pass model selection algorithm and computation¬ 
ally more attractive than the other methods. The validity of the penalty term has been 
confirmed for other binary LVMs, e.g., HMMs [Eujimaki and Hayashi, 2012], latent 
feature models [Hayashi and Eujimaki, 2013], and mixture of experts [Eto et al., 2014]. 

Despite EAB’s practical success compared with BIC and VB, it is unclear that what 
conditions are actually necessary to guarantee that EAB yields the true latent variable 
dimensionality. In addition, the generalization of EIC for non-binary LVMs still re¬ 
mains an important open issue. In case that Z takes negative and/or continuous values, 
is no longer interpretable as the number of effective samples, and we loose the 
clue for finding the redundant dimension of Z. 

'Note that MILE is not equivalent to maximum a posteriori estimator (MAP). MJLE is given by 
argmax 0 p(X, Z|0) and the MAP is given by argmaxQ p(X, Z|0)p(©). 
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This paper proposes generalized FIC (gFIC), given by 

gFIC(i^) =E,. [£(Z, n, K)] + H{q), (2) 

£(2, n, K) = lnp(X, Z\n,K)-Un\Fs\-^ In TV. 

Here, q*{Z) = p(Z | X,£r) is the marginal posterior and is the Hessian matrix 
of — lnp(X, Z I n, K)/N with respect to H. In gFIC, the penalty term is given by 
the volume of the (empirical) Fisher information matrix. It naturally penalizes model 
complexity even when the latent variable Z takes negative and/or continuous values. 
Accordingly, gFIC is applicable to a broader class of LVMs, such as Bayesian principal 
component analysis (BPCA) [Bishop, 1998]. 

Furthermore, we prove that FAB automatically prunes redundant dimensionality 
along with optimizing q, and gFIC for the optimized q asymptotically converges to 
the marginal log-likelihood with a constant order error under some reasonable assump¬ 
tions. This justifies gFIC as a model selection criterion for LVMs and further a natural 
one-pass model “pruning” procedure is derived, which is performed heuristically in 
previous FIC studies. We also provide an interpretation of gFIC as a variational free 
energy and uncover a few previously-unknown their relationships. This interpretation 
gives formal conditions for justifying that model selection by the VB marginal log- 
likelihood. 

Finally, we demonstrate the validity of gFIC by applying it to BPCA. The experi¬ 
mental results agree with to the theoretical properties of gFIC. 


2 LVMs and Degeneration 


We first define the class of LVMs we deal with in this paper. Here, we consider LVMs 
that have X-dimensional latent variables z„ (including the MMs in the previous sec¬ 
tion), but now z„ can take not only binary but also real values. Given X and a model 
family (e.g., MMs), our goal is to determine K and we refer to this as a model. Note 
that we sometimes omit the notation K for the sake of brevity, if it is obvious from the 
context. 

The LVMs have £)=-dimensional local parameters H = ..., and D®- 

dimensional global parameters 0, which can include hyperparameters of the prior of Z. 
We abbreviate them as n = {0, H} and assume that the dimension Dyi = Dq+Ds is 
finite. Then, we define the joint probability: p(X, Z, IT) = p(X | Z,n)p(Z | n)p(n) 
where lnp(X, Z | 11) is twice differentiable at 11 G P and let Fn = 


f F® F®,h\ 
Fs J 



iiip(x,z I n) 

90 ds) ]\T 


Note that the MILE II = argmaxjj lnp(X, Z | 11) depends on Z (and X). In addition, 
lnp(X, Z I n) can have multiple maximizers, and II could be a set of solutions. 

Model redundancy is a notable property of LVMs. Because the latent variable Z 
is unobservable, the pair (Z, 11) is not necessarily determined uniquely for a given X. 
In other words, there could be pairs (Z, 11) and (Z, 11), whose likelihoods have the 
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same value, i.e., J3(X, Z | 11, K) = p(X, Z | 11, K). Previous FIC studies address this 
redundancy by introducing a variational representation that enables treating Z as fixed, 
as we explain in the next section. However, even if Z is fixed, the redundancy still 
remains, namely, the case in which Z is “degenerated,” and there exists an equivalent 
likelihood with a smaller model K' < K: 

p(x,z|n,x) = p(x,ZK'(3) 

In this case, K is overcomplete for Z, and Z lies on the subspace of the model K. As a 
simple example, let us consider a three-component MM for which Z = (z,l — z,0). In 
this case, ^3 is unidentifiable, because the third component is completely unused, and 
the K' = 2-component MM with Z 2 = (z, 1 — z) and 112 = (®, (€i)€ 2 )) satisfies 
equivalence relation (3). The notion of degeneration is defined formally as follows. 

Definition 1. Given X and K, Z is degenerated if there are multiple MJLEs and any 
Fpj of the MJLEs are not positive definite. Similarly, pfZ) is degenerated in dis¬ 
tribution, !/Ep[Fpj] are not positive definite. Let k(Z) = rank(Fpj) and K.{p) = 
rank(Ep[Fjj]). 

The idea of degeneration is conceptually understandable as an analogous of linear 
algebra. Namely, each component of a model is a “basis”, Z are “coordinates”, and 
k{Z) is the number of necessary components to represent X, i.e., the “rank” of X 
in terms of the model family. The degeneration of Z is then the same idea of the 
“degeneration” in linear algebra, i.e., the number of components is too many and 11 is 
not uniquely determined even if Z is fixed. 

As discussed later, given a degenerated Z where k{Z) = K', finding the equivalent 
parameters Zk' and Uk' that satisfy Eq. (3) is an important task. In order to analyze 
this, we assume Al): for any degenerated Z under a model K > 2 and K' < K, 
there exists a continuous onto mapping (Z, 11) —>■ {Zk', EIk') that satisfies Eq. (3), 
and Zk' is not degenerated. Note that, if P is a subspace of ^ 3 linear projection 
V : |_^ R^n^, satisfies Al where V is the top-I?nj^/ eigenvectors of Fn- This 

is verified easily by the fact that, by using the chain rule, Fff = VFpjV^, which is a 
diagonal matrix whose elements are positive eigenvalues. Therefore, Fff^^ is positive 
definite and Zk' is not degenerated. 

Let us further introduce a few assumptions required to show the asymptotic prop¬ 
erties of gEIC. Suppose A2) the joint distribution is mutually independent in sample- 
wise. 


p(x,z|n,x) = X{p{^n,^n |n,iT), (4) 

n 

and A3) lnp(n | K) is constant, i.e., limAr_>oo lnp(n | K)/N = 0. In addition, A4) 
p(n I K) is continuous, not improper, and its support V is compact and the whole 
space. Note that for almost all Z, we expect that II G P is uniquely determined and 
Ffj is positive definite, i.e., AS) if Z is not degenerated, then lnp(X, Z | 11, K) is 
concave and det |Fff | < 00 . 
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2.1 Examples of the LVM Class 

The above definition covers a broad class of LVMs. Here, we show that, as examples, 

MMs and BPCA are included in that class. Note that A2 does not allow correlation 
among samples and analysis of cases with sample correlation (e.g. time series models) 
remains as an open problem. 

MMs In the same notation used in Section 1, thejointlikelihoodis givenbyp(X, Z|n) = 

Y{nY{k{PkPk{y^n\ik)Y"''‘ where pk is the density of component k. If 

have no overlap, Fs is the block-diagonal matrix whose block is given by = 

— lnpfc(x„|^;.)z„fc/iV. This shows that the MM is degenerated, when more 

than one column of Z is filled by zero. For that case, removing such columns and cor¬ 
responding suffices as Zk' and Hk' in Al. Note that if pk is an exponential-family 
distribution exp(x^^j, — V’(€fc))> —VV lnpfc(x„|^^) = = C does not de¬ 

pend on n and gFIC recovers the original formulation of FICmm, i-C-, 5 In |F^- | = 

1 In |C(^^)| = % In + const. 

BPCA Suppose X G is centerized, i.e., = 0. Then, the joint likeli¬ 

hood ofXandZ G R^x^isgivenbyp(X,Z|n) = n„ Af(x„|Wz„, il)iV(z„|0,1), 
where H = W = (w.i,..., w.x) is a linear basis and © = A is the reciprocal of the 
noise variance. Note that the original study of BPCA [Bishop, 1998] introduces the 
additional priors p(W) = Af(wd| 0 , diag(a“^)) and p(A) = Gamma(A|aA, &a) 

and the hyperpriorp(Q:) = Gamma(afc Iuq-, &«)■ In this paper, however, we do not 
specify explicit forms of those priors but just treat them as 0 ( 1 ) term. 

Since there is no second-order interaction between and the Hessian Fh 

is a block-diagonal and each block is given by ^Z^Z. The penalty term is then given 
as 

-iln|Fs| = -|(XlnA-fln|lz^Z|), (5) 

and Z is degenerated, if rank(Z) < K. Suppose that Z is degenerated, let K' = 
rank(Z) < AT, and let the SVD be Z = Udiag((T)V^ = (Uk', 0)diag(crx', 0)(V/f/, 0)^, 
where \]k' and 'Vk' are K' non-zero singular vectors and ctk' is K' non-zero singu¬ 
lar values. From the definition of Fh, the projection V removes the degeneration of Z, 
i.e., by letting Z = ZV and W = WV, 

lnp(X,Z I n,X) = -^||X- ZW^IIi - i||Z||i + const. 

= _^||X-ZW^||2_i||Z||i +const. 

= lnp(X,ZK' I {\^k'},K'). 

where ||A||p = Xij denotes the Frobenius norm, Zk' = U_ft:'diag(cri^/), and 
'WK' = 'W'VK'- V transforms K — K' redundant components to 0-column vectors, 
and we can find the smaller model K' by removing the 0-column vectors from W and 
Z, which satisfies Al. 
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3 Derivation of gFIC 

To obtain p(X | K), we need to marginalize out two variables: Z and 11. Let us 
consider the variational form for Z, written as 

\np{X\K) =E,[lnp(X, Z\K)] + H{q) + KL(<z|k*) (6) 

=E,.[lnp(X,Z|iL)]+iJ(g*), (7) 

where KL(g||p) = J q{x) \n q{x)/p{x)dx is the Kullback-Leibler (KL) divergence. 

Variational representation (7) allows us to consider the cases of whether Z is de¬ 
generated or not separately. In particular, when Z ~ 7 *(Z) is not degenerated, then 
A5 guarantees that p(X, Z | K) is regular, and standard asymptotic results such as 
Laplace’s method are applicable. In contrast, if q*{Z) is degenerated, p(X, Z | K) 
becomes singular and its asymptotic behavior is unclear. 

In this section, we analyze the asymptotic behavior of the variational representa¬ 
tion (7) in both cases and show that gFIC is accurate even if q*{Z) is degenerated. Our 
main contribution is the following theorem.^ 

Theorem 2. Let K' = k(p(Z | X, K)). Then, 

lnp(X I iL) =gFIC(iT') + 0(l). (8) 

We emphasize that the above theorem holds even if the model family does not 
include the true distribution of X. To prove Theorem 2, we first investigate the asymp¬ 
totic behavior of lnp(X | K) for the non-degenerated case. 

3.1 Non-degenerated Cases 

Suppose TT is fixed, and consider the marginalizationp(X, Z) = / p(X, Z|n)p(n)dn. 
If p(Z|X) is not degenerated, then Z ~ p(Z|X) is not degenerated with probability 
one. This suffices to guarantee the regularity condition (AS) and hence to justify the 
application of Laplace’s method, which approximates p(X, Z) in an asymptotic man¬ 
ner [Tierney and Kadane, 1986]. 

Lemma 3. If Z is not degenerated, p(X, Z) = 

p(X,Z,ri)|Fn|-'/2(^^j (l + 0(7V-i)). (9) 

This result immediately yields the following relation: 

\np{X,Z)= C{Z,fl,K)+0{l). (10) 

Substitution of Eq. (10) into Eq. (7) yields Eq. (8).Note that we drop the 0(1) terms: 
Inp(n) (see A3), ln27r, and a term related to Fq to obtain Eq. (10). We em¬ 
phasize here that the magnitude of Fn (and F® and Fs) is constant by definition. 

formal proof is given in supplemental material. 
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Therefore, ignoring all of the information of Fpj in Eq. (9) just gives another 0(1) 
error and equivalence of gFIC ( 8 ) still holds. However, Fs contains important infor¬ 
mation of which component is effectively used to captures X. Therefore, we use the 
relation In |Fn| = In |Fs| + In |FH,nFn^Fj^nl and remain the first term in gFIC. In 
Section 4.3, we interpret the effect of Fh in more detail. 

3.2 Degenerated Cases 

If p(Z I X, K) is degenerated, then the regularity condition does not hold, and we 
cannot use Faplace’s method (Lemma 3) directly. In that case, however, A1 guarantees 
the existence of a variable transformation (Z, 11 ) — {Zk' , TIk' ) that replaces the joint 
likelihood by the equivalent yet smaller “regular” model: p(X, Z | K) = 

I p{x,z\n,K)p{n\K)dn 

= J p(X,Zk' I tlK',K')p{tlK' I K')dflK>. (11) 

Since Zk' is not degenerated in the model K', we can apply Laplace’s method and 
obtain asymptotic approximation (10) by replacing K by K'. Note that the transformed 
prior pITIk' | AT') would differ from the original prior p{IIk' \ K'). However, 
since the prior does not depend on N (A3), the difference is at most 0(1), which 
is asymptotically ignorable. 

Eq. (11) also gives us an asymptotic form of the marginal posterior. 

Proposition 4. 


p{Z\X,K)=pk{Z){1 + 0{N-^)), 

PKiZ) = { C K-K[Z), 

b«(z)(T«(z)(Z)) K>k{Z), 

where Tk' '■ Z —^ Zk' as Eq. (3) and C is the normalizing constant. 

The above proposition indicates that, if K.{p{Z \ X, K)) = K', p{Z \ X,K) 
is represented by the non-degenerated distribution p{Z \ X, K'). Now, we see that 
the joint likelihood (11) and the marginal posterior (12) depend on K' rather than K. 
Therefore, putting these results into variational bound (7) leads to (8), i.e., lnp(X | K) 
is represented by gFIC of the “true” model K'. 

Theorem 2 indicates that, if the model K is overcomplete for the true model K', 
lnp(X I K) takes the same value as lnp(X | K'). 

Corollary 5. For every K > K' = k{p{Z \ X)), 

lnp(X I X) =lnp(X I X') + 0(1). (14) 

This implication is fairly intuitive in the sense that if X concentrates on the sub¬ 
space of the model, then marginalization with respect to the parameters outside of the 


( 12 ) 

(13) 
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subspace contributes nothing to lnp(X | K). Corollary 5 also justifies model selec¬ 
tion of the LVMs on the basis of the marginal likelihood. According to Corollary 5, at 
fV —?► oo redundant models always take the same value of the marginal likelihood as 
that of the true model, and we can safely exclude them from model candidates. 

4 The gFAB Inference 

To evaluate gFIC (2), we need to solve several estimation problems. First, we need 
to estimate p(Z | X, K) to minimize the KL divergence in Eq. (6). In addition, since 
lnp(X I K) depends on the true model K' (Theorem 2), we need to check whether the 
current model is degenerated or not, and if it is degenerated, we need to estimate K'. 
This is paradoxical, because we would like to determine K' through model selection. 
However, by using the properties of gFIC, we can obtain K' efficiently by optimization. 

4.1 Computation of gFIC 

By applying Laplace’s method to Eq. (11) and substituting it into the variational form (6), 
we obtain lnp(X | K) = 

EJ/:(Z, n, Kiq))] + H{q) + KL(g||g*) + 0(1). (15) 

Since the KL divergence is non-negative, substituting this into Eq. (8) and ignoring the 
KL divergence gives a lower bound of gFIC(Ar'), i.e., 

gFIC(iT') > EJ£(Z, n, «(g))] + H{q). (16) 

This formulation allows us to estimate gFIC(Ar') via maximizing the lower bound. 
Moreover, we no longer need to know K' —if the initial dimension of q is greater than 
K', the maximum of lower bound (16) attains gFIC(A'') and thus lnp(X | K'). Sim¬ 
ilarly to other variational inference problems, this optimization is solved by iterative 
maximization of q and 11. 

4.1.1 Update of q 

As suggested in Eq. (15), the maximizer of lower bound (16) is p(Z | X) in which 
the asymptotic form is shown in Proposition 4. Unfortunately, we cannot use this 
as q, because the normalizing constant is intractable. One helpful tool is the mean- 
field approximation of q, i.e., q{Z) = J][^ g„(z„). Although the asymptotic marginal 
posterior (12) depends on n due to Fn, this dependency eventually vanishes for N 
oo, and the mean-field approximation still maintains the asymptotic consistency of 
gFIC. 

Proposition 6. Suppose p{Z\'K, K) is not degenerated in distribution. Then, p{Z\X.^ K) 
converges to p(Z|X, 11, AT), and p(Z|X, AT) is asymptotically mutually independent 
for'Ll, 



In some models, such as MMs [Fujimaki and Morinaga, 2012], the mean-field ap¬ 
proximation suffices to solve the variational problem. If it is still intractable, other 
approximations are necessary. For example, we restrict q as the Gaussian density 
qiZ) — 7V(z„|/x„, S„) for BPCA which we use in the experiments (Section 7). 


4.1.2 Update of n 

After obtaining q, we need to estimate II for each sample Z ~ 9 (Z), which is also in¬ 
tractable. Alternatively, we estimate the expected MILE II = argmaxjj Eq[lnp(X, Z | 
n)] . Since the max operator has convexity, Jensen’s inequality shows that replacing II 
by n introduces the following lower bound. 

E,[lnp(X,Z I n)] =EJmaxlnp(X,Z | H)] 

>E,[lnp(X,Z I n)] = maxEJlnp(X,Z I H)]. 

Since II depends only on q, we now need to compute the parameter only once. Re¬ 
markably, n is consistent with II and the above equality holds asymptotically. 

Proposition 7. If q(Z) is not degenerated in distribution, then n H. 

Since Eq[lnp(X, Z | 11)] is the average of the concave function (AS), Eg [lnp(X, Z | 
n)] itself is also concave and the estimation of II is relatively easy. If the expectations 
Eg[lnp(X, Z|n)] and Eg[Fn] are analytically written, then gradient-based optimiza¬ 
tion suffices for the estimation. If these is no analytic form, then stochastic optimiza¬ 
tion, such as stochastic gradient assuming Z ~ ^(Z) as a sample [Kingma and Welling, 
2013], might help. 

4.1.3 Model Pruning 

During the optimization of q, it can become degenerated or nearly degenerated. In 
such a case, by definition of objective (16), we need to change the form of £(Z, II, K) 
to £(Z,n, AT'). This can be accomplished by using the transformation (Z,n) — ^ 
(Zif/, TIk') and decreasing the current model from K to K', i.e., removing degener¬ 
ated components. We refer to this operation as “model pruning”. We practically verify 
the degeneration by the rank of Fs, i.e., we perform model pruning if the eigenvalues 
are less than some threshold. 

4.2 The gFAB Algorithm 

Algorithm 1 summarizes the above procedures, solving the following optimization 
problem: 

maxEg[£(Z, n(g), ^(g))] + H{q), (17) 

q^Q 

where Q = {q{Z) \ q{Z) = n„ <?n(zn)}. As shown in Propositions 6 and 7, the above 
objective is the lower bound of Eq. (16) and thus of gFIC(Ar'), and the equality holds 
asymptotically. 
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Algorithm 1 The gFAB algorithm 

Input: data X, initial model K, threshold 5 

repeat 

argmaXggQE,[£(Z,f[, «((?))] + H{q) 
if ctk(Fh) < • • • < cfK'i^'s.) < 5 then 
K ^ K' and (Z,n)^ (Zk/,!!^/) 

end if 

n 4- argmaxjj Eq[lnp(X, Z | 11, K)] 
until Convergence 

Corollary 8. 

{g¥lG{K')= Eq.{\l) forN 

[gFIC(A:') > -E?-(17) for a finite N >0. ^ 

The gFAB algorithm is the block coordinate ascent. Therefore, if the pruning 
threshold 6 is sufficiently small, each step monotonically increases objective (17), and 
the algorithm stops at critical points. 

A unique property of the gFAB algorithm is that it estimates the true model K' 
along with the updates of q and H. If N is sufficiently large and the initial model 
Kynax is larger than K', the algorithm learns pKfZf) as q, according to Proposition 4. 
At the same time, model pruning removes degenerated K—K' components. Therefore, 
if the solutions converge to the global optima, the gFAB algorithm returns K'. 

4.3 How Fn Works? 

Proposition 4 shows that if the model is not degenerated, objective (17) is maximized 
at g(Z) = p_ff(Z) oc p(Z|X, n)|Fp[|~^/^, which is the product of the unmarginal¬ 
ized posterior p(Z|X, n) and the gFIC penalty term |Fp[|“^/^. Since |Fp[|”^/^ has 
a peak where Z is degenerating, it changes the shape of p(Z|X, 11) and increases the 
probability that Z is degenerated. Figure 1 illustrates how the penalty term affects the 
posterior. 

Note that, if the model family contains the true distribution of X, then Fpj con¬ 
verges to the Fisher information matrix. From another viewpoint, Fpj is interpreted 
as the covariance matrix of the asymptotic posterior of 11. As a result of applying 
the Bernstein-von Mises theorem, the asymptotic normality holds for the posterior 
p(n|X, Z) in which the covariance is given by (A^Fpj)”^. 

Proposition 9. Let fl — '/N (n — n). Then, ifZ is not degenerated, |p(r2 | X, Z) — 
iV(0,E[Fn]-i)| 4 0. 

This interpretation has the following implication. In maximizing the variational 
lower bound (7), we maximize — ^ In |Fh|. In the gFAB algorithm, this is equivalent 
to maximize the posterior covariance and pruning the components where those covari¬ 
ance diverge to infinity. Divergence of the posterior covariance means that there is 
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Figure 1: The gFIC penalty |F^| changes the shape of the posterior p(Z | X, H) 
as increasing the probability of degenerated Z (indicated by diagonal stripes). 


insufficient information to determine those parameters, which are not necessary for the 
model and thus can reasonably be removed. 


5 Relationship with VB 

Similarly to FAB, VB alternatingly optimizes with respect to Z and 11, whereas VB 
treats both of them as distributions. Suppose K < K', i.e., the case when the posterior 
p(Z I X, K') is not degenerated in distribution. Then, the marginal log-likelihood is 
written by the variational lower bound: lnp(X | K') = 

E,(z,n) [lnp(X, Z, H | K')] + H{q{Z, H)) 

+ KL((?(z,n)|b(z,n|x,x')) 

>E,(z,n) [lnp(X, Z, U \ K')] + iT(g(Z, H)) 

>E,(z),(n) [lnp(X, Z, H | K')] + H{q{Z)) + H{q{U)), (19) 

where we use the mean-field approximation q{Z,U) = q{Jl)q{Z) in the last line. 
Minimizing the KL divergence yields the maximizers of Eq. (19), given as 

g(n) cx exp (E,(Z) [lnp(X, Z, U \ K')]) , (20) 

g(Z) cx exp (E,(n) [lnp(X, Z, H | K')]) . (21) 

Here, we look inside the optimal distributions to see the relationship with the 
gFAB algorithm. Let us consider to restrict the density (j(n) to be Gaussian. Since 
Eq(z) Z I n, K')] increases proportional to N while Inp(n) does not, g(n) 
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attains its maximum around H. Then, the second order expansion to In 9 ( 11 ) at n 
yields the solution 9 ( 11 ) = -/V(r[, (iVFn)“^). We remark that this solution can be 
seen as an empirical version of the asymptotic normal posterior given by Proposition 9. 
Then, if we further approximate lnp(X, Z | 11, K') by the second order expansion 
at n = n, the other expectation E^(n) [lnp(X, Z | 11, K')] appearing in Eq. (21) is 
evaluated by lnp(X, Z | f[, X')-iln|Fn|. Under these approximations, alternat¬ 
ing updates of {f[, and 9 (Z) coincide exactly with the gFAB algorithm^, which 
justifies the VB lower bound as an asymptotic expansion of p(X | K'). 

Proposition 10. Let Lvb{K) be the VB lower bound (19) with restricting 9 ( 11 ) to be 
Gaussian and approximating the expectation in In 9 (Z) by the second order expansion. 
ThenJorK < K', lnp(X | K) = Lyb{K) + 0(1^ 

Proposition 10 states that the VB approximation is asymptotically accurate as well 
as gFIC when the model is not degenerated. For the degenerated case, the asymp¬ 
totic behavior of Lyb{K) of general FVMs is unclear; however, a few specific mod¬ 
els such as Gaussian MMs [Watanabe and Watanabe, 2006] and reduced rank regres¬ 
sions [Watanabe, 2009] have been analyzed in both degenerated and non-degenerated 
cases. 

Proposition 10 also suggests that the mean-field approximation does not loose the 
consistency with p(X | K). As shown in Proposition 9, for K < K', the posterior 
covariance is given by (A^Fpj)^^, which goes to 0 for N ^ oo, i.e., the posterior con¬ 
verges to a point. Therefore, mutual dependence among Z and 11 eventually vanishes 
in the posterior, and the mean-field assumption holds asymptotically. This observation 
also allows further employment of the mean-field approximation to 9(11). For exam¬ 
ple, BPCA has two parameters 11 = {W, A} (see Section 2.1), in which the joint 
distribution 9 (W, A) has no analytical solution. However, the independence assump¬ 
tion 9 (W, A) = q{'W)q{X) gives us analytical solutions of 9 (W), 9 (A), and 9 (Z) 
under suitable conjugate priors. As discussed above, since both W and A converge to 
points, this approximation still maintains Proposition 10. 


^Note that model pruning is not necessary when K < K'. 
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EM 

BICEMT 

VB 

CVB 

EAB gEABT 

Objective 

Eq. (22) 


Eq. (19) 

Eq. (7) 

Eq. (16) 

n 

Point estimate 

Posterior w/ ME 

Marginalized out 

Laplace approximation 

<7(Z) 

= p(z|x,n) 

^P(Z|X) 

cxp(Z|X)(l + 0(l))t 

lnp(X|A: < K') 

0{\nN) 0(1)^ 

o(TF 

o(TF 

lnp{X\K > K') 

NA 

Generally NA 

W? 

Applicability 

Many models 

Many models 

Binary LVMs 

Binary LVMs LVMs 


Table 1: A comparison of approximated Bayesian methods. The symbol f highlights our contributions. “MF” stands for the mean-field 
approximation. Note that the asymptotic relations with lnp(X | K) hold only for LVMs. 














6 Related Work 

The EM Algorithm Algorithm 1 looks quite similar to the EM algorithm, solving 

maxEg[lnp(X | 11, AT)] + H{q). (22) 

g,n 

We see that both gFAB and EM algorithms iteratively update the posterior-like distri¬ 
bution of Z and estimate 11. The essential difference between them is that the EM 
algorithm infers the posterior p(Z|X, IT) in the E-step, but the gEAB algorithm infers 
the marginal posterior p(Z|X) ~ p(Z|X, As discussed in Section 4.3, 

the penalty term increases the probability mass of the posterior, where Z 

is degenerating, enabling automatic model determination through model pruning. In 
contrast, the EM algorithm lacks such pruning mechanism, and always overfits to X as 
long as N is finite while p(Z|X) eventually converges to p(Z|X, IT) for A —>■ oo (see 
Proposition 6). 

Note that Eq. (22) has C)(ln N) error against Inp(X). Analogously to gEIC, this er¬ 
ror is easily reduced to 0(1) by adding — In N. This modification provides another 
information criterion, which we refer to as BICEM. 

VB Methods The relationship between the VB and gFAB algorithms is discussed in 
the previous section. 

Collapsed VB (CVB) [Teh et al., 2006] is a variation of VB. Similarly to FAB, 
CVB takes the variational bound after marginalizing out 11 from the joint likelihood. 
In contrast to FAB, CVB approximates q in a non-asymptotic manner, such as the first- 
order Taylor expansion [Asuncion et al., 2009]. Although such approximation has been 
found to be accurate in practice, its asymptotic properties, such as consistency, have not 
been explored. Note that as one of those approximations, the mean-field assumption 
(/(Z) G Q is used in the original paper on CVB [Teh et al., 2006], motivated by the 
intuition that the dependence among {z„} is weak after marginalization. Proposition 6 
formally justifies this asymptotic independence assumption on the marginal distribu¬ 
tion employed in CVB. 

Several authors have studied about asymptotic behaviors of VB methods for LVMs. 
Wang and Titterington [2004] investigated the VB approximation for linear dynamical 
systems (a.k.a. Kalman filter) and showed the inconsistency of VB estimation with 
large observation noise. Watanabe and Watanabe [2006] derived an asymptotic varia¬ 
tional lower bound of the Gaussian MMs and demonstrated its usefulness for the model 
selection. Recently, Nakajima et al. [2014] analyzed the VB learning on latent Dirichlet 
allocation (EDA) [Blei et al., 2003], who revealed conditions for the consistency and 
clarified its transitional behavior of the parameter sparsity. By comparing with these 
existing works, we have a contribution in terms of that our asymptotic analysis is valid 
for general LVMs, rather than individual models. 

BIC and Extensions Let Y = (X, Z) be a pair of non-degenerated X and Z. By ig¬ 
noring all the constant terms of Laplace’s approximation (9), we obtain BIC [Schwarz, 
1978] considering Y as an observation, which is given by the right-hand side of the 
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following equation. 


lnp(Y I K) = \np{Y \ 11, K) - In TV + 0(1). 

Unfortunately, the above relation does not hold for p(X | K). Since p(X | K) = 
f p(Y I i^)dZ mixes up degenerated and non-degenerated cases, p(X | K) always 
becomes singular, loosing the condition AS that Laplace’s approximation holds. 

There are several studies that extend BIC to be able to deal with singular models. 
Watanabe [2009] evaluates p(X | K) with an 0(1) error for any singular models by 
using algebraic geometry. However, it requires an evaluation of the intractable rational 
number called the real log canonical threshold. Recent study [Watanabe, 2013] relaxes 
this intractable evaluation to the evaluation of criterion called WBIC at the expense of 
an Op(v^inJV) error. Yet, the evaluation of WBIC needs an expectation with respect to 
a practically intractable distribution, which usually incurs heavy computation. 


7 Numerical Experiments 

We compare the performance of model selection for BPCA explained in Section 2.1 
with the EM algorithm, BICEM introduced in Section 6, simple VB (VBl), full VB 
(VB2), and the gFAB algorithm. VB2 had the priors for W, A, and a described in 
Section 2.1 in which the hyperparameters were fixed as a\ = b\ = Ua = ba = 0.01 by 
following [Bishop, 1999]. VBl is a simple variant of VB2, which fixed a = 1. In this 
experiments. We used the synthetic data X = ZW^+E where W ^ uniform([0, !])“*, 
Z ~ A(0,I), and End ^ TV(0,ct^). Under the data dimensionality D = 30 and 
the true model K' = 10, we generated data with N = 100, 500,1000, and 2000. 
We stopped the algorithms if the relative error was less than 10“® or the number of 
iterations was greater than lO'^. 

Figure 2 depicts the objective functions after convergence for K = 2, , 30. Note 
that, we performed gFAB with K = 30 and it finally converged at AT ~ 10 owing to 
model pruning, which allowed us to skip the computation for K ~ 10,..., 30, and 
the objective values for those it's are not drawn. We see that gFAB underestimated the 
model when the number of samples were small {N < 500), but it successfully chose 
K = 10 with sufficiently large sample sizes (N > 500). In contrast, the objective 
of EM slightly but monotonically increased with K, which means EM always chose 
the largest K as the best model. This is because EM maximizes Eq. (22), which does 
not impose the penalty on the model complexity brought by the marginalization of 
n. As our analysis suggested in Section 6, BICEM and VBl are close to gFAB as N 
increasing and has a peak around K' = 10, meaning that BICEM and VBl are adequate 
for model selection. However, in contrast to gFAB, both of them need to compute for 
all K. Interestingly, VB2 were unstable for N < 2000 and it gave the inconsistent 
model selection results. We observed that VB2 had very strong dependence on the 
initial values. This behavior is understandable because VB2 has the additional prior 

^This setting could be unfair because VBl and VB2 assume the Gaussian prior for W. However, we 
confirmed that data generated by W ~ N{0, 1) gave almost the same results. 
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and hyperparameters to be estimated, which might produce additional local minima 
that make optimization difficult. 

8 Conclusion 

This paper provided an asymptotic analysis for the marginal log-likelihood of LVMs. 
As the main contribution, we proposed gFIC for model selection and showed its con¬ 
sistency with the marginal log-likelihood. Part of our analysis also provided insight 
into the EM and VB methods. Numerical experiments confirmed the validity of our 
analysis. 

We remark that gFIC is potentially applicable to many other LVMs, including factor 
analysis, EDA, canonical correlation analysis, and partial membership models. Inves¬ 
tigating the behavior of gFIC on these models is an important future research direction. 
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A Proofs 

Proof of Proposition 4. If Z is not degenerated, then Laplace’s method yields Eq. (10). 
By collecting from Eq. (10) the terms that depend on Z, we obtain 

p(Z I X, K) cx p(Z, X I n, 7T)|Fn|-'/2(l + 0{N-^)). (23) 

If p(Z I X,Ar) is degenerated, we consider the transformation (11). Here, the 
transformed prior | K') would differ from the original prior p{IIk' \ K'). 

However, since the mapping 11 —11;^/ is onto A1 and the prior is strictly positive 
in the whole space of 11 A4, p(n | K') is also strictly positive, including IIk' = 
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argmaxjj^, lnp(X, | YVk'tK'). Consequently, we can again use Laplace’s method 
for lnp(X, I IIk' , K'), and by collecting the terms that depend on Z, we obtain 

p(X I Z, X) cx p(X, Zk' I + O(iV-i)) (24) 

^Pk’{^k'.K'){1 + 0{N-^)). (25) 

This concludes the proof. □ 

Proof of Theorem 2. First, we prove the case that p(Z | X, iT) is not degenerated. 
In that case, Laplace’s approximation yields Eq. (10) in probability, and substituting 
Eq. (10) into (7) gives ( 8 ). 

If k{p{Z I X, iT)) = K' < K, Proposition 4 gives us that p{Z \ X,iT) = 
p/f'(Z)(l + 0{N~^)). Since 

Ep(z|x.tf) [lnp(X, Z I iT)] = [lnp(X, Z | iT)] + 0(1) 

and 

iJ(p(Z I X, K)) = (1 + 0{N-^))H{pk') + (1 + O(lV-i)) ln(l + 0{N-^)) 

= H{pk') + 0{1), 
lnp(X I K) is rewritten by 

Ep^, [lnp(X, Z\K)]+ H{pk') + 0(1) (26) 

=Ep^, [C{ZK’.tlK’,K')] + H{pk') + 0(1) (27) 

Here, since the projection T^' : Z —^ Zk' is continuous and onto (Al), we can 
describe pK' (Z) as the density of Zk' by using a change of variables, which we denote 
by Pk' (Z/f' )■ Now, we can rewrite the first term as the integral over Zk', i.e.. 


Ep^,[£(ZK,,flK,,7T0] = J C{TKfZ),UK>,K')pKfTK'iZ))dZ (28) 

= J C{ZK’,ii-K’,K')pK'{ZK’)dZK'- (29) 

Similarly, gFIC(Ar') is rewritten using Proposition 4 as 

gFIC(7T') = Ep^, [CiZK',flK',K')] + H{pk') + 0(1) (30) 

Again, the first term is written as 

¥.j,^,[C{ZK',flK',K')]= J C{ZK',flK',K')pK'{TK'{Z))dZK' (31) 

= J C{ZK',ilK',K')pK'{ZK')dZK' (32) 

Since Eq. (29) and (32) are the same, this concludes Eq. (8). 

□ 
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Proof of Proposition 6. Proposition 4 shows that, if Z is non-degenerated, 


p(Z|X,K)cxp(X,Z|ri)|Fn|-'/2 (33) 

cxnP(x„,z„ I ri)|Fn|-i/"'^ (34) 

n 

Since In |Fn| = 0(1), quickly diminishes to 1 for N —>■ oo. □ 

Proof of Proposition 7. For technical reasons, we redehne the estimators as follows: 

n = argmaxpAr(n) = argmax ^ lnp(X, Z|n), (35) 

n n 

n = argmaxGAr(n) = argmaxE 5 [-^ lnp(X, Z|n)]. (36) 

n n 1'' 

According to A5, (/jv(n) is continuous and concave, and it uniformly converges to 
Gw(n), i.e., 

sup | 5 jv(n)-Gjv(n)| 4o. (37) 

nep 

This suffices to show the consistency (for example, see Theorem 5.7 in [van der Vaart, 
1998].) □ 
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method j-*-! EM |~^| BICEM B VB1 □ VB2QgFAB 



Figure 2: The objective function versus the model K. The errorbar shows the standard 
deviations over 10 different random seeds, which affect both data and initial values of 
the algorithms. 
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